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Summary. 

While graphical models for continuous data (Gaussian graphical models) and discrete data (Ising models) have 

been extensively studied, there is little work on graphical models linking both continuous and discrete variables 

(mixed data), which are common in many scientific applications. We propose a novel graphical model for mixed 

data, which is simple enough to be suitable for high-dimensional data, yet flexible enough to represent all possible 

graph structures. We develop a computationally efficient regression-based algorithm for fitting the model by 

focusing on the conditional log-likelihood of each variable given the rest. The parameters have a natural group 

structure, and sparsity in the fitted graph is attained by incorporating a group lasso penalty, approximated by 

a weighted £i penalty for computational efficiency. We demonstrate the effectiveness of our method through 

^~~' an extensive simulation study and apply it to a music annotation data set (CAL500), obtaining a sparse and 

00 

psj interpretable graphical model relating the continuous features of the audio signal to categorical variables such as 

■^ genre, emotions, and usage associated with particular songs. While we focus on binary discrete variables, we 

o 

^^ also show that the proposed methodology can be easily extended to general discrete variables. 

^ — I 

>• Key Words: Conditional Gaussian density, Graphical model, Group lasso. Mixed variables, 

^ Music annotation. 

1. Introduction 

Graphical models have proven to be a useful tool in representing the conditional dependency 
structure of multivariate distributions. The undirected graphical model in particular, some- 
times also referred to as the Markov network, has drawn a lot of attention over the past 
decade. In an undirected graphical model, nodes in the graph represent the variables, while 
an edge between a pair of variables indicates that they are dependent conditional on all other 
variables. The vast majority of the graphical models literature has been focusing on either 



the multivariate Gaussian model (Meinshausen and Biihlmann 2006 Yuan and Lin, 2007 
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Rothman et al] |2008] [d'Aspremont et al.j |2008[ |Rocha et al] |2008t |Ravikumar et aL| |2008 



Lam and Fan, 2009; Peng et al. 2009 Yuan, 2010; Cai et al. 2011 Friedman et al. 2008), 



or the Ising model for binary and discrete data (Hofling and Tibshirani 2009 Ravikumar 



et al. , 2010 Guo et al. , 2010). The properties of these models are by now well understood 



and studied both in the classical and the high- dimensional settings. Both these models only 
deal with variables of one kind - either all continuous variables in the Gaussian model or all 
binary variables in the Ising model (extensions of the Ising model to general discrete data, 
while possible in principle, are rarely used in practice). In many applications, however, data 
sources are complex and varied, and frequently result in mixed types of data, with both 
continuous and discrete variables present in the same dataset. In this paper, we will focus 
on graphical models for this type of mixed data (mixed graphical models). 



The conditional Gaussian distribution was originally proposed (Lauritzen and Wermuth 



1989 Lauritzen, 1996) to model mixed data and has become the foundation of most de- 



velopments on this topic. In the original paper, Lauritzen and Wermuth (1989) define a 
general form of the conditional Gaussian density and characterize the connection between 
the model parameters and the conditional associations among the variables. The model is 
fitted by maximum likelihood, but the number of parameters in this model, however, grows 
exponentially with the number of variables, which renders this approach unsuitable for high- 
dimensional problems arising in many modern applications. Much more recently, Lee and| 



Hastie (2012) and Fellinghauer et al. (2011) have studied the mixed graphical model (si- 



multaneously and independently of the present work), under a setting that could be viewed 
as a simplified special case of our proposal. A more detailed discussion of these papers is 
postponed to Section [6] 

In this paper, we propose a simplified version of the conditional Gaussian distribution 
which reduces the number of parameters significantly yet maintains flexibility. To flt the 
model in a high- dimensional setting, we impose a sparsity assumption on the underlying 
graph structure and develop a node-based regression approach with the group lasso penalty 



(Yuan and Lin, 2006), since edges in the mixed graphical model are associated with groups of 
parameters. The group lasso penalty in itself is not computationally efficient, and we develop 
a much faster weighted ii approximation to the group penalty which is of independent 
interest. The simulation results show promising model selection performance in terms of 
estimating the true graph structure under high-dimensional settings. 

To start with, we give a brief introduction to the conditional Gaussian distribution and 



its Markov properties following Lauritzen (1996). 

Conditional Gaussian (CG) density: let X = {Z, Y) be a mixed random vector, 
where Z = {Zj)j^/^ is a g-dimensional discrete sub- vector, and Y = (Ky)^gr is a p-dimensional 
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continuous sub-vector. The conditional Gaussian density /(x) is defined as 

/(x) = f{z,y) = exp (g, + h^y - -y^K.y 

where {{gz, hz, Kz),gz G M, /iz G M^, Kz G Mpxp; ^ ^ Range(Z)} are the canonical parameters 
of the distribution. The following equations connect the canonical parameters in ([l]) to the 
moments (P^, C,z, S^): 

Pz = P{Z = z) = {2T,Y'\det{Kz))-''^ exp ((?, + h^K:'hz/2) , 

iz = e{y\z = z) = k;'k, 

E, = Var(r|Z = z) = K~\ 

and the conditional distribution of Y given Z = ^ is 7V(^2, T,z). 

The next theorem relates the graphical Markov property of the model to its canonical 
parameters and serves as the backbone of the subsequent analysis. 

Theorem 1. Represent the canonical parameters from ([I]) by the following expansions, 

9z= Yl ^'^('^)' ^^= Yl ^'^('^)' ^^= Yl '^^('^) ' (2) 

d-.dCA d-.dCA d.dCA 

where functions indexed by d only depend on z through Zd- Then a CG distribution is Marko- 
vian with respect to a graph Q if and only if the density has an expansion that satisfies 

\d{z) = unless d is complete in Q, 
rij{z) = unless dU {7} is complete in Q, 
^'Y{z) = unless dU {7,/i} is complete in Q. 

where rp^iz) is the 'j-th element ofr]d{z), $^^(2;) is the '~ffi-th element of^d{z), and a subgraph 
is called complete if it is fully connected. 

The rest of the paper is organized as follows. Section |2] introduces the simplified mixed 
graphical model which has just enough parameters to cover all possible graph structures, 
proposes an efficient estimation algorithm for the model, and discusses theoretical guarantees 
for the proposed method under the high-dimensional setting. Section [3] uses several sets of 
simulation studies to evaluate the model selection performance and compare some alternative 
choices of regularization. In Section [4], the proposed model is applied to a music annotation 
data set CAL500 with binary labels and continuous audio features. In Section [5| we describe 
the generalization of the model from binary to discrete variables. Finally, we conclude with 
discussion in Section [6l 
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2. Methodology 

Our main contribution is a simplified but flexible conditional Gaussian model for mixed 
data. Model fitting is based on maximizing the conditional log-likelihood of each variable 
given the rest for computational tractability. This leads to penalized regression problems 
with an overlapping group structure of the parameters, the natural solution to which is to 
fit separate regressions with an overlapping group lasso penalty. This is computationally 
quite expensive, so we approximate the overlapping group lasso penalty by an appropriately 
rescaled £i penalty. 

2. 1 . The simplified mixed grapiiical model 

Recall we partition the random vector X = {Zi, Z2, . . . , Zg, Yi, Y2, . . . , Yp) into the binary 
part Zj G {0, 1}, j = 1, . . . ,g, and the continuous Y^ G M, 7 = 1, . . . ,p. We propose to 
consider the conditional Gaussian distribution with the density function 



log f{z,y) = Yl Mz)+ Yl ^'i(^fy-2 Yl y^'^<i{z)y 

d:dCA,\d\<2 d:dCA,|d|<l d:dCA,\d\<l 

\ j j>k J \ j J \ j=l J 

^o + J2 ^^^3 + Y ^jkZjZk j +J2[^o^Yl ^J^j' ) y^ 

\ j j>k / 7=1 \ j / 

-III U' + i'^r^^y^y^^ (3) 

7,/i=l \ j=l / 



Y det(i^,)-^/2 exp I Y ^i^i + Y ^i^ZjZk + h^K-^hj2 j 
e{o,i}9 V j j>k J 



where {diag($j)}^^]^ = {$J'^; j = 1, . . . , g, 7 = 1, . . . ,p} are all and Aq is the normalizing 
constant, 

Ao = I (27r)P/2 

Note that the density is explicitly defined via the expanded terms in (pi) but the canoni- 
cal parameters {gz,hz,Kz) can be obtained immediately by summing up the corresponding 
terms. This model simplifies the full conditional Gaussian distribution ([I]) in two ways: 
first, it omits all interaction terms between the binary variables of order higher than two, 
and second, it models the conditional covariance matrix and the canonical mean vector of 
the Gaussian variables as a linear function of the binary variables instead of allowing their 
dependence on higher order interactions of the binary variables. These simplifications reduce 
the total number of parameters from 0{p'^2^P'^'^^) in the full model to O (max(g^,p^g)). On 
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the other hand, this model is the simplest CG density among those allowing for varying 
conditional covariance Var(y|Z) that can represent all possible graph structures, since it 
includes interactions between all the continuous and discrete variables and thus allows for a 
fully connected graph, an empty graph, and everything in between. The fact that it allows 
both the conditional mean and the conditional covariance of Y given Z to depend on Z adds 
flexibility. 

2.2. Parameter estimation 

Given sample data {(zj, j/j)}"^]^, directly maximizing the log-likelihood XliLi ^^S/l-^i' ?/i) 
is impractical due to the normalizing constant Aq- The conditional likelihood of one vari- 
able given the rest, however, is of much simpler form and easy to maximize. Hence, we 
focus on the conditional log-likelihood of each variable and fit separate regressions to es- 
timate the parameters, much in the spirit of the neighborhood selection approach pro- 



posed by Meinshausen and Biihlmann (2006) for the Gaussian graphical model and by 



Ravikumar et al. (2010) for the Ising model. To specify the conditional distributions, let 
Z_j = (Zi, . . . , Zj-i, Zj^i, . . . , Zq) and Y^^ = (Yi, . . . , Ky_i, l^+i, . . . , Yp). Then the condi- 
tional distribution of Zj given {Z_j, Y) is described by 

Since the conditional log-odds in ^ is linear in parameters, maximizing this conditional 
log-likelihood can be done via fitting a logistic regression with {Z^j, Y, Y"^) as predictors and 
Zj as response. 

For the continuous variables, the conditional distribution of Y^ given {Y_^, Z) is given by 

n = ^ (^0^ + E ^J^.- - E f^o'^ + E ^r ^.) ^m) + ^- 

^ V j ii+1 V i / / 

where e^ ~ Ar(0, {KJ^Y^). With diag($j) = as defined by (|3|), we have KJ< = $2;^, i.e., 
the conditional variance of Ky does not depend on Z. Rewrite 



n = v2 + y: v]z, - E f ^r + E ^rz) 

3 M7^7 \ 3 / 



Y^ + e^ , (5) 



where the redefined parameters with "tilde" are proportional to the original ones up to the 
same constant for each regression. Again, the conditional mean of Y^ is linear in parameters, 
which can be estimated via ordinary linear regression with predictors (VI7, Z^ Y^^Z) and 
response Y^. 
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2.3. Regularization 

Based on Theorem [l} the following equivalences hold: 
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(6) 

This means that each edge between pairs of {Zj,Y^) and {Y^,Y^) depends on a parameter 
vector, denoted by 6j^/ and 6^^, respectively. To encourage sparsity of the edge set under 
high-dimensional settings, a natural choice here would be to use a group lasso penalty, such 
as the ii\i2 penalty proposed by Yuan and Lin (2006) for group lasso. The groups are pre- 
determined by parameter vectors corresponding to each edge. Denoting the loss function for 
the logistic regression of Zj by £j and the linear regression for Y^ by i^, we have 

1 "" 

^j = X1^°S(^(% I (2*,(-j),?/i)), 

^ i=i 

-, n q q 

i=l 3=1 /i^7 j=l 

We estimate the parameters by optimizing the following criteria separately, for j = 1, . . . ,q 

and 7 = 1, ... ,7 

Logistic regression: min ij + p / 11-^^^111+ / ll^jTlb 1 , (7) 
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Linear regression: min i^ + p 

where p is the tuning parameter. We use the same tuning parameter for all regressions 
to simplify tuning, but they can also be tuned separately if the computational cost is not 
prohibitive. Another reason to use a single tuning parameter is to simplify the treatment of 
overlapping groups of parameters from different regressions (see more on this below). Note 
that in linear regression, the parameters in (Is]) denoted with "tilde" are proportional to the 
original parameters. The original parameters can be recovered by multiplying the estimates 
by {K]'^)~^, which can be estimated from the mean squared error of the linear regression. 

Although the optimization problems ([T]) and ([s]) each have the form of regular group lasso 
regressions, they can not be jointly solved by existing group lasso algorithms, because the 
groups of parameters involved in each regression overlap. Specifically, in logistic regression, 
the parameter $J^ is part of both 0j^ and Oj^ and determines both the edges {Zj,Yy) and 
{Zj, Y^); thus 6j.y has one parameter overlapping with each of the other ^j^^'s. Similarly, in 
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linear regression, $J'^ is part of both 6j^ and 6^^, and affects both the edges {Zj,Y^) and 
(F^, Y^). This overlapping pattern creates additional difficulties in using the group penalty to 
perform edge selection. The overlapping group lasso problem has received limited attention 



from a computational point of view. Yuan et al. (2011) recently proposed an algorithm for 



solving the overlapping group lasso problem, but it is computationally intensive for high- 
dimensional data. Instead of optimizing the overlapping group penalty directly, we look for 
a surrogate penalty with similar properties that is easier to optimize and thus more suitable 
for a high-dimensional setting. 

The weighted £i penalty can provide an upper bound for the group penalty, since for any 
vector b, ||b||2 < ||^||i- For the logistic regression ([7|, for example, we have 



7M| 



E II viii + E ii^^^ih < E i^^-^i + E i^Ji + 2 E 1^] 

kj^j 7=1 kjf^j 7=1 7</^ 

The surrogate on the right penalizes the overlapping parameters twice as much as the other 
parameters, which makes intuitive sense since incorrectly estimating the overlapping param- 
eters as non-zero will add two wrong edges, while incorrectly estimating unique parameters 
for each group as non-zero will only add one wrong edge. 




Fig. 1: Green (outside): {b : ^Jhl + b^ + ^b^ + b^ = 1}; Blue (mside): {6 : |&i| + I&3I + 2|62| = 1} 



To further illustrate why this upper bound provides a good approximation to the feasible 
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region, consider the following toy example. Let the parameter vector be b = (61, 62, ^3), with 
groups Qi = (61, 62) and Q2 = (^2? ^3)- The optimization problems for the overlapping group 
lasso penalty and its ii surrogate boil down to optimizing the same loss function on different 
feasible regions (for the same tuning parameter). Figure [I] compares the two feasible regions, 
7^l = {b : ^/bj + hi + ^Jl)i + ^ < 1} and 7^2 = {b : |&i| + M + 2\h2\ < 1}. Since both the 
logistic loss and the least squares loss are smooth convex functions, the solutions will lie at 
singular points of the feasible regions IZi and IZ2. Note that IZ2 is not only a subset of TZi 
but it contains all four singular points of TZi. (±1,0,0), (0,0, ±1), which are also singular 
points in 7^2- Thus in this example, the optimum will be chosen from exactly the same set 
of singular points regardless of the penalty used. For higher dimensional 6, it still holds that 
7^2 is a subset of T^i, but it may not contain all of the singular points of TZi (this depends 
on the group structure). While that means that we may not have exactly the same optimal 
points, the approximation could still be good enough to identify the same non-zero groups, 
and this is what we found in practice. 

With the group penalty replaced by the weighted di surrogate, we solve the following 
regression problems separately as an approximation to the original problems Q and rtsl) to 
obtain the parameter estimates. 
Logistic regression with ii penalty: for j = 1, . . . , g 

min £, + p r£ |A,.| + J2 W,\ + 2 "£ \^f\\ . (9) 

VMj 7=1 7<A' / 

Linear regression with di penalty: for 7 = 1, . . . ,|) 

min d, + p\i2 I^JI + E l^ol + 2 E E \^T\] ■ (10) 

Since we are estimating the parameters in separate regressions, all parameters determining 
edges will be estimated at least twice. This problem is common to all neighborhood selection 
approaches based on separate regressions, and is usually solved by taking either the largest or 
the smallest (in absolute value) of the estimates. For the mixed Gaussian model, we found 
that taking the maximum of absolute values results in somewhat better model selection, 
and that is what we use throughout the paper as the final estimate. To fit both types of 



regressions with a weighted £1 penalty, we use the matlab package glranet of Friedman et al. 



(2010). 



2.4. Asymptotic Properties 

Model selection consistency in high-dimensional regression problems is equivalent to esti- 
mating the non-zero parameters as non-zero. Correctly identifying the sign of non-zero 
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parameters (in addition to correctly identifying them as non-zero) is referred to as sign con- 
sistency, while the usual £2 convergence of all the estimated parameters is referred to as norm 
consistency. Regression with li penalty has been shown to possess both sign consistency and 
norm consistency under certain regularity conditions, of which the most important one is 



the irrepresentable condition (see Meinshausen and Biihlmann (2006); Zhao and Yu (2006); 



Van de Geer and Biihlmann (2009) for linear regression and Ravikumar et al. (2010); Van de 
Geer (2008) for logistic regression). To show consistency of our method, we only need to 
require the main assumptions of the existing results to hold on a rescaled version of the 
original design matrix for each regression. Since we fit weighted £i-penalized regressions 
with fixed pre-determined weights, we can treat the parameters multiplied by the weights as 
new parameters, and rescale the design matrix for each regression by dividing each column 
the corresponding weights. This converts the problem to a standard £i-penalized regression, 
and consistency is thus guaranteed by existing results cited above. 

3. Numerical performance evaluation 

This section includes simulation studies on evaluating model selection performance under 
different settings and comparing alternative choices of penalties. The results are summarized 
in ROC curves, where we plot the true positive count (TP) against false positive count (FP) 
and the true positive rate (TPR) against the false positive rate (FPR), for both parameters 
and edges over a fine grid of tuning parameters. Let and denote the true parameter vector 
and the fitted parameter vector respectively (without the intercept terms in the regressions) , 
the parameter based quantities are defined as follows, 

TP = #{j -Jj^Q and Oj i^ 0}, FP = #{j J^ 7^ and Qj = 0}, 

TP FP 

TPR = -—^— — -^, FPR 



The quantities based on the true edge set E and the estimated edge set JE are defined in a 
similar fashion, with parameter sets replaced by edge sets. 



3.1. Model selection performance 

In the first simulation, we fix the variable dimensions to be p = 90 continuous variables and 
g = 10 discrete variables, with the sample size n = 100. We vary the maximum node degree 
(i.e., the number of edges from the node) in the graph while maintaining the total number 
of edges fixed at 80. This results in graph structures with varying "uniformity", because 
given the total number of edges in a graph, the smaller the maximum node degree, the more 
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Fig. 2: Upper left: percentage-based ROC curves for parameter identification; upper right: count-based ROC curves 
for parameter identification; lower left: percentage-based ROC curves for edge identification; lower right: count-based 
ROC curves for edge identification. The maximum node degree varies in {2, 6, 10}, and the total number of edges is 
fixed at 80. The variable dimensions are p = 90, g = 10; the sample size is n = 100; the curves are averaged over 20 
replications. 

uniform the graph is. A chain graph, for example, is the most extreme case of a uniform 
graph. 

The data are generated as follows. First we generate the underlying graph structure given 
a maximum node degree: for the case where the maximum node degree is 2, we use a chain 
graph of the first 81 nodes; when the maximum node degree is 6, we generate a random 
graph from the Erdos-Renyi model. To enforce the maximum node degree constraint, we 
simply regenerate the adjacency matrix if there are any degrees larger than 6, which does 
not require many attempts since the graphs are very sparse under this setting. For the 
maximum node degree of 10, we assign 10 edges to the first node and randomly generate 
the other 70 edges from the Erdos-Renyi model, thus creating a sparse graph with a hub. 
The first q nodes in the adjacency matrix are taken to correspond to the binary variables. 
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Given the true graph, we set all parameters corresponding to absent edges to be 0. For the 
other (non-zero) parameters, we set the {Xj, \jk,rij} to be 1 or —1 with equal probability, 
and the off-diagonal elements of {$0)"^^} to 2 or —2 with equal probability. The diagonal 
elements of $o are chosen so that $0 + X]j=i ^j^j i^ positive definite for all possible z^s. We 
then generate the discrete variables Zj's based on 2"^ probabilities given by P^ in (pi). Since 
we use the exact discrete probabilities rather than MCMC methods to generate the binary 
data, the storage requirements prohibits taking a very large q in simulations; however, this 
does not affect real data, and the method works well with large q in Section |4} Finally, for 
each Zi we generate the continuous variables j/j from a multivariate Gaussian distribution 
with mean ^^^ and covariance S^. defined by (pi). 

The results in Figure [2] contain four ROC curves for both parameters and edges across 
a fine grid of the tuning parameters p, recorded both in percentage terms (FPR and TPR) 
and also in terms of counts (TP and FP), to get a better sense of the results on the real scale 
of the problem. The cut-off point for the parameter rate-based FPR is chosen at the point 
after which the curve does not change much, and the range of count-based FP is chosen to 
approximately match the range of FPR. The results show that as the maximum node degree 
increases, the model selection performance deteriorates, even though the total number of 
edges remains fixed. 

Next, we vary the degree of sparsity in the true graphs. The variable dimensions are again 
fixed at p = 90, q = 10, and the sample size at n = 100. The number of edges is set to either 
60, 80, or 100, while the degree of all nodes is at most 3. The true graph is again generated 
from the Erdos-Renyi model with the specified number of edges. The results are shown in 
Figure |3| Even though the underlying graphs are getting more dense as the number of edges 
increases, as long as the maximum degree is controlled, the percentage-based ROC curves 
for both edges and parameters stay nearly the same. The count-based ROC curves are not 
directly comparable in this case due to the varying number of parameters and edges of the 
true model. 

3.2. Comparison with alternative penalized regressions 

In this simulation, we compare our proposed method (denoted by weighted £1) with two other 

penalized regression approaches using the edge ROC curves. The first alternative (denoted by 



simple ii), discussed by Fellinghauer et al. (2011 ), is to fit separate £1 regularized regressions 



by regressing each variable on the others, without including any of the interaction terms. 
The second alternative (denoted by Li_regular) we consider is to replace the penalty in our 
method with the regular ii penalty, ignoring the grouping patterns and the overlaps between 
groups. We consider two settings relevant for this comparison: true graphs with and without 
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Fig. 3: Upper left: percentage-based ROC curves for parameter identification; upper right: count-based ROC curves 
for parameter identification; lower left: percentage-based ROC curves for edge identification; lower right: count-based 
ROC curves for edge identification. The number of edges varies in {60,80,100}, and the maximum node degree is 
at most 3. The variable dimensions are p = 90, q = 10; the sample size is n = 100; the curves are averaged over 20 
replications. 

complete subgraphs. 



3.2.1. Graphs without complete subgraphs 

When the underlying graph does not contain any complete subgraphs, which can easily 
happen if it is very sparse, all the overlapping parameters <I>J^ are zero. From (|6|, we can see 
that each edge is then represented by a unique parameter: the edge corresponding to {Zj, Y^) 
is determined by rij; the edge for {Zj,Zk) is determined by Xjk] and the edge for (Y^f,Yfj_) 
is determined by ^q^^. Then all the interaction terms in regressions Q and (tsl) vanish. We 
follow the set-up of the first simulation, where the maximum node degree takes values in 
{2, 6, 10}, the total number of edges is fixed at 80, the dimensions are p = 90,q = 10, and 
n = 100. If a true graph we generate contains a complete subgraph, we discard it and 
generate a new one. Figure |4] shows that in each subplot, simple ii and weighted ii perform 
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similarly, and they both outperform the regular ii. The results are to be expected, because 
for a true model with no interaction terms the simple ii, which excludes interaction terms 
automatically, should perform the best. Our weighted ii approach penalizes the interaction 
terms twice as much as the other parameters, which allows it to achieve a similarly good 
performance. The generic regular ii penalty fails to capture the group structure among the 
parameters and performs noticeably worse than the other two methods. We can conclude 
from this that if the underlying model is believed to be very sparse, simple ii does well by 
not including the interaction terms; our method weighted ii does equally well even with the 
ineffective interactions. The generic regular ii, which ignores the group structure, is the 
least accurate of the three alternatives. 
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Fig. 4: Edge-based ROC curves for sparse graphs without complete subgraphs. Blue solid: weighted (.i; red dash-dot: 
simple i?i; blue dash: regular l\. The maximum node degree is 2 (left), 6 (middle), or 10 (right), the total number of 
edges is 80. The variable dimensions are p — 90, q — 10; sample size is n = 100; the curves are averaged over 20 data 
replications. 

3.2.2. Graphs with complete subgraphs 

In this part, we study the case where the true graph contains fully connected dense subgraphs. 
Specifically, we set p = 40 and q = 10, make the first 20 nodes completely connected, and the 
other nodes are not connected to anything. This gives approximately 190 edges and 650 non- 
zero parameters. To obtain comparable signal-to-noise ratios and comparable ROC curves 
in this setting, we adjust the sample size to n = 200 and decrease the non-zero parameter 
values by a factor of 10. For graphs with complete subgraphs, there are non-zero parameters 
^J'^ in overlapping groups, which implies that the true model includes some interaction terms 
in Q and ^. 

Further, we consider two types of true models compatible with the graph structure above 
for regressions: in model I, both main and interaction effects in Q and (tsl) are non-zero, 
and model II has main effects only and all the interaction effects are zero. Figure |5] shows 
the results. When interaction terms are present in the true model, the regular ii penalty 
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and our weighed penalty perform similarly, and better than the simple ii penalty, as one 
would expect. When only main effects are present in the true model, all three methods show 
similar performance with the weighted and simple £i approaches having a slight advantage 
for reasons similar to those in Section 13.2. 1[ 



Model I: main & interaction effects 




Model II: main effects only 



0.2 0.3 

False Positive Rate (FPR) 




0.2 0.3 0.4 

Raise Positive Rate (FPR) 



Fig. 5: Edge-based ROC curves for graphs with complete subgraphs. Blue solid: Li .weighted; red dash-dot: 
I/i_simple; blue dash: I/i_regular. Left: both main and interaction effects present, right: main effects only. The 
variable dimensions are p — 40, q = f 0; sample size is n = 200; the curves are averaged over 20 data replications. 

Taken together, the simulation results indicate that the weighted penalty approach does 
a good job balancing the need for penalizing interactions more than the main effects because 
of double-counting, and the need to keep them in the model. If we have no prior information 
on whether to expect complete subgraphs in the network or not, the weighted ii penalty 
appears to be the safest choice. 



4. Application to music annotation data 

Music annotation has been studied by researchers in many areas, including audio signal 
processing, information retrieval, multi-label classification, and others. Music annotation 
data sets usually consist of two parts: "labels", typically assigned by human experts, contain 
the categorical semantic descriptions of the piece of music (emotions, genre, vocal type, etc.); 
and "features" , continuous variables extracted from the time series of the audio signal itself 
using well developed signal processing methods. Representing these mixed variables by 
a Markov graph would allow us to understand how these different types of variables are 
associated with each other. For example, one can ask which rhythm and timbre features are 
associated with certain genres. We apply our method to the publicly available music data 



set CAL500 (TurnbuU et al. 2008), from the Mulan database (Tsoumakas et al., 2011), to 



find the conditional dependence patterns among the mixed variables. 

The CAL500 dataset consists of 502 popular western music tracks (including both English 



Mixed Graphical Models 1 5 
language songs and instrumental music) composed within the last 55 years by 502 different 
artists. The collection covers a large range of acoustic variations and music genres, and the 
labeling of each song is supervised by at least three individuals. For each song, the label 
part includes a semantic vocabulary of 149 tags represented by a 149-dimensional binary 
vector indicating the presence of each annotation. Specifically, the labels are partitioned 
into the following six categories: emotions (36 total), genres (31), instruments (24), song 
characteristics (27), usages (15), and vocal types (16). The continuous features of the music 
are based on the short time Fourier transform (STFT) and are calculated for each short 
time window by sliding a half-overlapping, 23ms time window over the song's digital audio 



file. Detailed description of the feature extraction procedure can be found in Tzanetakis 



and Cook (2002). For each analysis window of 23ms, the following continuous features 



are extracted to represent the audio file: the spectral centroid, a measure of 'brightness' of 
the music texture with higher value indicating brighter music with more high frequencies; 
spectral flux, a measure of the amount of local spectral change; zero crossings, a measure of 



the noisiness of the signal; and the first MFCC coefficient (Logan, 2000) representing the 
amplitude of the music, which comes from a two-step transformation designed to capture 
the spectral structure. Every consecutive 512 of the 23ms short frames are then grouped 
into Is long texture windows, based on which the following summary statistics for the four 
features defined above were calculated and used as the final continuous part of the data: 
overall mean, mean of the standard deviations of each texture window, standard deviation 
of the means of each texture window, and standard deviation of the standard deviations of 
each texture window. 

In our analysis, we omitted labels which were assigned to less than 3% of the songs and 
kept only the first MFCC coefficient since it can be interpreted as the overall amplitude of 
the audio signal, and other coefficients are not readily interpretable. Also, we standardized 
the continuous variables. This resulted in a dataset with n = 502 observations, g = 118 
discrete variables, and p = 16 continuous variables. 



We applied our method coupled with stability selection (Meinshausen and Biihlmann 



2010) to identify the underlying Markov graph for the purposes of exploratory data analysis. 



which is the primary usage of graphical models. To perform stability selection, we run our 
algorithm 100 times on randomly drawn sub-samples of size n/2, and kept only the edges that 
were selected at least 90 times. The results are shown in Figure |6} The continuous timbre 
features are represented by squares labeled 1-16 and the binary variables are represented by 
circles labeled 1-118. Each color represents a category of variables as shown in the legend. 
The graph has a number of interesting and intuitive connections. The continuous variables 
that represent the audio signal features are quite densely connected within themselves, which 
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Fig. 6: Estimated graphical model for CAL500 music data (edges with stability selection frequency of at least 0.9). 



is to be expected. More interesting edges can be found between continuous features and 
expert labels. The average amplitude of the music (square 4) is connected with the genre 
"alternative rock" (circle 37) and the instrument "Synthesizer" (circle 72). The noisiness 
of the music (square 13) is associated with "negative feelings" (circle 87), which seems 
reasonable and has also been reported by [Blumstein et aL (2012). We also find connections 
between short period amplitude variation (square 8) with popular likeable songs (circle 84) 
and not tender or soft emotions (circle 34). Moreover, there are some interesting patterns 
within the group of binary labels, which allow us to infer connections between different 
emotions, genres, instruments, usages and so on. For example, songs with positive feelings 
(circle 86) are connected to piano (circle 67), happy and not happy emotions (circle 17 and 
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18, highly negatively correlated), songs with high energy (circle 82) and optimistic emotions 
(circle 27). We also find edges connecting fast and not fast tempo music (circles 78 and 79) 
to classic rock (circle 38), songs with high energy (circle 82) and very danceable and not 
danceable songs (circles 97 and 98). Likeable or popular songs (circle 84) are associated with 
usages such as driving (circle 101) and reading (circle 107), which makes intuitive sense. 

5. Extension to general discrete data 

To extend our model to the general case where the discrete variables can take more than 
two values, we modify the previous model ^ into the following, 

log f{z,y) = Yl Mz)+ Yl Vd{zfy-2 Yl y^'^d{z)y , 

d:dCA,\d\<2 d:dCA,\d\<l d:dCA,\d\<l 

^o + Yl ^j (^j) + Yl ^.^'fc^^j' ^fc) 1 + 5Z I ^d + Zl ^J^^j) ) y^ ' 

j=l j>k / 7=1 \ j=l / 



7,^=1 \ j=l / 



^7?/^ ' (11) 



where each Zj takes integer values 1 to Kj; \j{-), v]{')j ^]'^{') ^^^ s-H discrete functions 
which take on Kj possible values and Xjk{-, ■) is a discrete function with Kj x Kk values. For 
identifiability, we set Xj{l) = 0, r/J(l) = 0, ^]^{1) = and Xjk{l,-) = >^jk{-A) = 0. The 
correspondence between the parameters and the edges is then given by 

Zj J- Zk I X\{Zj, Zk} <^=^ 6jk = {\jk{zj,Zk)) = , 

Z,±K,| X\{Z,,Y,} ^^ e,,= {v]iz,),{^Tiz,):f,eT\{^}})=0, 

Y,±Y,\ X\{Y„Y,} ^^ 0,, = ($2M$f (^,) : J G A}) = . (12) 

The generalized model can be fitted with separate regressions based on the conditional like- 



lihood of each variable. The parameters in (12) still have group structure, which calls for 



using the group lasso penalty as in ([T]) and (JSJ). The overlapping structure is more complex 



in this case, and we use the upper bound ii approximation as in rt9j) and ( 10 ) to obtain the 
final estimates. Specifically, we minimize the following criteria separately: 

Logistic regression with ii penalty: for j = 1, . . . , g 



mm tj 



+p(E E iA,.(.„..)i + EEi^J(^^-)i+2EEi^r 

fey^i (zj.Zfc) 7=12:^=1 ^.^ti Zj = l 
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Linear regression with ii penalty: for 7 

mm t^ + p\ ^^Wj{zj) 



1, 



>P 



A', 



. i=i ^j 







6. Discussion 



We have proposed a new graphical model for mixed (continuous and discrete) data, which 
is particularly suitable for high-dimensional data. While the general conditional Gaussian 



model goes back to Lauritzen and Wermuth ( 1989 ), it is not appropriate for high-dimensional 



data, and there is little previous work on mixed graphical models that can scale to modern 
applications. While our model is substantially simpler than the original general model, it 
scales much better. Given that graphical models are primarily a tool for exploratory data 
analysis, this is a reasonable trade-off, and the ability to explore conditional dependence 
relationships between large numbers of discrete and continuous variables will hopefully be 
of use to practitioners in a range of application domains. 

We have recently become aware of two new developments on this topic (which were derived 



in parallel with and independently of this manuscript). Lee and Hastie (2012) assume a 
more restricted version of the conditional Gaussian density by assuming constant conditional 
covariance for all the continuous variables. Their model can be viewed as a special case of 
ours in dsl), where all the $j are 0, which can be too restrictive for some applications. They 
considered both the maximum likelihood approach and maximum pseudo likelihood approach 
with a group lasso penalty for general discrete variables; when the discrete variables are all 
binary, this is equivalent to using the regular £1 penalty, which is one of the alternatives 



we compared in simulations in Section 3.2 Lee and Hastie (2012) did not focus on high- 



dimensional applications. Another recent paper, Fellinghauer et al. (2011), applied random 



forests and stability selection in fitting £i-regularized regressions of each variable on the rest, 
without specifying any generative models for the mixed data. In our framework, this would 
amount to taking the separate regression approach to a simplified conditional Gaussian model 
as in Lee and Hastie (2012), except with regression fitted using random forests coupled with 



stability selection. As our simulations show, such an approach performs well when the true 
graph is very sparse and has no complete subgraphs; when the true graph has complete 
subgraphs, our method is expected to outperform a separate regression approach. 
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